Goto

Collaborating Authors

 pathology report


PathReasoning: A multimodal reasoning agent for query-based ROI navigation on whole-slide images

Zhang, Kunpeng, Xu, Hanwen, Wang, Sheng

arXiv.org Artificial Intelligence

Deciphering tumor microenvironment from Whole Slide Images (WSIs) is intriguing as it is key to cancer diagnosis, prognosis and treatment response. While these gigapixel images on one hand offer a comprehensive portrait of cancer, on the other hand, the extremely large size, as much as more than 10 billion pixels, make it challenging and time-consuming to navigate to corresponding regions to support diverse clinical inspection. Inspired by pathologists who conducted navigation on WSIs with a combination of sampling, reasoning and self-reflection, we proposed "PathReasoning", a multi-modal reasoning agent that iteratively navigates across WSIs through multiple rounds of reasoning and refinements. Specifically, starting with randomly sampled candidate regions, PathReasoning reviews current selections with self-reflection, reasoning over the correspondence between visual observations and clinical questions, and concludes by proposing new regions to explore. Across rounds, PathReasoning builds a reasoning chain that gradually directs attention to diagnostically relevant areas. PathReasoning turns each whole slide into a sequence of question-guided views, allowing the model to efficiently find informative ROIs within a fixed number of steps, without the need for dense pixel-level annotations. PathReasoning can substantially outperform strong ROI-selection approaches by 6.7% and 3.1% of AUROC on subtyping and longitudinal analysis tasks. The high-quality ROIs further support accurate report generation on breast cancer, significantly outperforming the standard GPT-4o by 10% in accuracy. PathReasoning prioritizes question-specific regions and constructs interpretable reasoning chains, supporting efficient slide review, consistent diagnostic interpretations, comprehensive reporting, and evidence traceability in digital pathology.


Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages

Spaanderman, Douwe J., Prathaban, Karthik, Zelina, Petr, Mouheb, Kaouther, Hejtmánek, Lukáš, Marzetti, Matthew, Schurink, Antonius W., Chan, Damian, Niemantsverdriet, Ruben, Hartmann, Frederik, Qian, Zhen, Thomeer, Maarten G. J., Holub, Petr, Akram, Farhan, Wolters, Frank J., Vernooij, Meike W., Verhoef, Cornelis, Bron, Esther E., Nováček, Vít, Grünhagen, Dirk J., Niessen, Wiro J., Starmans, Martijn P. A., Klein, Stefan

arXiv.org Artificial Intelligence

Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records, but prior work often focuses on single tasks, limited models, and English-language reports. We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases, colorectal liver metastases, liver tumours, neurodegenerative diseases, soft-tissue tumours, melanomas, and sarcomas, at three institutes in the Netherlands, UK, and Czech Republic. Models included general-purpose and medical-specialised LLMs of various sizes, and six prompting strategies were compared: zero-shot, one-shot, few-shot, chain-of-thought, self-consistency, and prompt graph. Performance was assessed using task-appropriate metrics, with consensus rank aggregation and linear mixed-effects models quantifying variance. Top-ranked models achieved macro-average scores close to inter-rater agreement across tasks. Small-to-medium general-purpose models performed comparably to large models, while tiny and specialised models performed worse. Prompt graph and few-shot prompting improved performance by ~13%. Task-specific factors, including variable complexity and annotation variability, influenced results more than model size or prompting strategy. These findings show that open-weight LLMs can extract structured data from clinical reports across diseases, languages, and institutions, offering a scalable approach for clinical data curation.


Knowledge Elicitation with Large Language Models for Interpretable Cancer Stage Identification from Pathology Reports

Lee, Yeawon, Yang, Christopher C., Chang, Chia-Hsuan, Lu-Yao, Grace

arXiv.org Artificial Intelligence

Cancer staging is critical for patient prognosis and treatment planning, yet extracting pathologic TNM staging from unstructured pathology reports poses a persistent challenge. Existing natural language processing (NLP) and machine learning (ML) strategies often depend on large annotated datasets, limiting their scalability and adaptability. In this study, we introduce two Knowledge Elicitation methods designed to overcome these limitations by enabling large language models (LLMs) to induce and apply domain-specific rules for cancer staging. The first, Knowledge Elicitation with Long-Term Memory (KEwLTM), uses an iterative prompting strategy to derive staging rules directly from unannotated pathology reports, without requiring ground-truth labels. The second, Knowledge Elicitation with Retrieval-Augmented Generation (KEwRAG), employs a variation of RAG where rules are pre-extracted from relevant guidelines in a single step and then applied, enhancing interpretability and avoiding repeated retrieval overhead. We leverage the ability of LLMs to apply broad knowledge learned during pre-training to new tasks. Using breast cancer pathology reports from the TCGA dataset, we evaluate their performance in identifying T and N stages, comparing them against various baseline approaches on two open-source LLMs. Our results indicate that KEwLTM outperforms KEwRAG when Zero-Shot Chain-of-Thought (ZSCOT) inference is effective, whereas KEwRAG achieves better performance when ZSCOT inference is less effective. Both methods offer transparent, interpretable interfaces by making the induced rules explicit. These findings highlight the promise of our Knowledge Elicitation methods as scalable, high-performing solutions for automated cancer staging with enhanced interpretability, particularly in clinical settings with limited annotated data.


Evaluating Foundation Models with Pathological Concept Learning for Kidney Cancer

Gao, Shangqi, Wang, Sihan, Gao, Yibo, Wang, Boming, Zhuang, Xiahai, Warren, Anne, Stewart, Grant, Jones, James, Crispin-Ortuzar, Mireia

arXiv.org Artificial Intelligence

To evaluate the translational capabilities of foundation models, we develop a pathological concept learning approach focused on kidney cancer. By leveraging TNM staging guidelines and pathology reports, we build comprehensive pathological concepts for kidney cancer. Then, we extract deep features from whole slide images using foundation models, construct pathological graphs to capture spatial correlations, and trained graph neural networks to identify these concepts. Finally, we demonstrate the effectiveness of this approach in kidney cancer survival analysis, highlighting its explainability and fairness in identifying low- and high-risk patients. The source code has been released by https://github.com/shangqigao/RadioPath.


Small or Large? Zero-Shot or Finetuned? Guiding Language Model Choice for Specialized Applications in Healthcare

Gondara, Lovedeep, Simkin, Jonathan, Sayle, Graham, Devji, Shebnum, Arbour, Gregory, Ng, Raymond

arXiv.org Artificial Intelligence

This study aims to guide language model selection by investigating: 1) the necessity of finetuning versus zero-shot usage, 2) the benefits of domain-adjacent versus generic pretrained models, 3) the value of further domain-specific pretraining, and 4) the continued relevance of Small Language Models (SLMs) compared to Large Language Models (LLMs) for specific tasks. Using electronic pathology reports from the British Columbia Cancer Registry (BCCR), three classification scenarios with varying difficulty and data size are evaluated. Models include various SLMs and an LLM. SLMs are evaluated both zero-shot and finetuned; the LLM is evaluated zero-shot only. Finetuning significantly improved SLM performance across all scenarios compared to their zero-shot results. The zero-shot LLM outperformed zero-shot SLMs but was consistently outperformed by finetuned SLMs. Domain-adjacent SLMs generally performed better than the generic SLM after finetuning, especially on harder tasks. Further domain-specific pretraining yielded modest gains on easier tasks but significant improvements on the complex, data-scarce task. The results highlight the critical role of finetuning for SLMs in specialized domains, enabling them to surpass zero-shot LLM performance on targeted classification tasks. Pretraining on domain-adjacent or domain-specific data provides further advantages, particularly for complex problems or limited finetuning data. While LLMs offer strong zero-shot capabilities, their performance on these specific tasks did not match that of appropriately finetuned SLMs. In the era of LLMs, SLMs remain relevant and effective, offering a potentially superior performance-resource trade-off compared to LLMs.


A Robust BERT-Based Deep Learning Model for Automated Cancer Type Extraction from Unstructured Pathology Reports

Tran, Minh, Chan, Jeffery C., Huang, Min Li, Kansara, Maya, Grady, John P., Napier, Christine E., Thavaneswaran, Subotheni, Ballinger, Mandy L., Thomas, David M., Lin, Frank P.

arXiv.org Artificial Intelligence

The accurate extraction of clinical information from electronic medical records is particularly critical to clinical research but require much trained expertise and manual labor. In this study we developed a robust system for automated extraction of the specific cancer types for the purpose of supporting precision oncology research. from pathology reports using a fine-tuned RoBERTa model. This model significantly outperformed the baseline model and a Large Language Model, Mistral 7B, achieving F1_Bertscore 0.98 and overall exact match of 80.61%. This fine-tuning approach demonstrates the potential for scalability that can integrate seamlessly into the molecular tumour board process. Fine-tuning domain-specific models for precision tasks in oncology, may pave the way for more efficient and accurate clinical information extraction.


Bridging AI Innovation and Healthcare Needs: Lessons Learned from Incorporating Modern NLP at The BC Cancer Registry

Gondara, Lovedeep, Arbour, Gregory, Ng, Raymond, Simkin, Jonathan, Devji, Shebnum

arXiv.org Artificial Intelligence

Automating data extraction from clinical documents offers significant potential to improve efficiency in healthcare settings, yet deploying Natural Language Processing (NLP) solutions presents practical challenges. Drawing upon our experience implementing various NLP models for information extraction and classification tasks at the British Columbia Cancer Registry (BCCR), this paper shares key lessons learned throughout the project lifecycle. We emphasize the critical importance of defining problems based on clear business objectives rather than solely technical accuracy, adopting an iterative approach to development, and fostering deep interdisciplinary collaboration and co-design involving domain experts, end-users, and ML specialists from inception. Further insights highlight the need for pragmatic model selection (including hybrid approaches and simpler methods where appropriate), rigorous attention to data quality (representativeness, drift, annotation), robust error mitigation strategies involving human-in-the-loop validation and ongoing audits, and building organizational AI literacy. These practical considerations, generalizable beyond cancer registries, provide guidance for healthcare organizations seeking to successfully implement AI/NLP solutions to enhance data management processes and ultimately improve patient care and public health outcomes.


Can human clinical rationales improve the performance and explainability of clinical text classification models?

Metzner, Christoph, Gao, Shang, Herrmannova, Drahomira, Hanson, Heidi A.

arXiv.org Artificial Intelligence

AI-driven clinical text classification is vital for explainable automated retrieval of population-level health information. This work investigates whether human-based clinical rationales can serve as additional supervision to improve both performance and explainability of transformer-based models that automatically encode clinical documents. We analyzed 99,125 human-based clinical rationales that provide plausible explanations for primary cancer site diagnoses, using them as additional training samples alongside 128,649 electronic pathology reports to evaluate transformer-based models for extracting primary cancer sites. We also investigated sufficiency as a way to measure rationale quality for pre-selecting rationales. Our results showed that clinical rationales as additional training data can improve model performance in high-resource scenarios but produce inconsistent behavior when resources are limited. Using sufficiency as an automatic metric to preselect rationales also leads to inconsistent results. Importantly, models trained on rationales were consistently outperformed by models trained on additional reports instead. This suggests that clinical rationales don't consistently improve model performance and are outperformed by simply using more reports. Therefore, if the goal is optimizing accuracy, annotation efforts should focus on labeling more reports rather than creating rationales. However, if explainability is the priority, training models on rationale-supplemented data may help them better identify rationale-like features. We conclude that using clinical rationales as additional training data results in smaller performance improvements and only slightly better explainability (measured as average token-level rationale coverage) compared to training on additional reports.


ELM: Ensemble of Language Models for Predicting Tumor Group from Pathology Reports

Gondara, Lovedeep, Simkin, Jonathan, Devji, Shebnum, Arbour, Gregory, Ng, Raymond

arXiv.org Artificial Intelligence

Population-based cancer registries (PBCRs) face a significant bottleneck in manually extracting data from unstructured pathology reports, a process crucial for tasks like tumor group assignment, which can consume 900 person-hours for approximately 100,000 reports. To address this, we introduce ELM (Ensemble of Language Models), a novel ensemble-based approach leveraging both small language models (SLMs) and large language models (LLMs). ELM utilizes six fine-tuned SLMs, where three SLMs use the top part of the pathology report and three SLMs use the bottom part. This is done to maximize report coverage. ELM requires five-out-of-six agreement for a tumor group classification. Disagreements are arbitrated by an LLM with a carefully curated prompt. Our evaluation across nineteen tumor groups demonstrates ELM achieves an average precision and recall of 0.94, outperforming single-model and ensemble-without-LLM approaches. Deployed at the British Columbia Cancer Registry, ELM demonstrates how LLMs can be successfully applied in a PBCR setting to achieve state-of-the-art results and significantly enhance operational efficiencies, saving hundreds of person-hours annually.


Cancer Type, Stage and Prognosis Assessment from Pathology Reports using LLMs

Saluja, Rachit, Rosenthal, Jacob, Artzi, Yoav, Pisapia, David J., Liechty, Benjamin L., Sabuncu, Mert R.

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have shown significant promise across various natural language processing tasks. However, their application in the field of pathology, particularly for extracting meaningful insights from unstructured medical texts such as pathology reports, remains underexplored and not well quantified. In this project, we leverage state-of-the-art language models, including the GPT family, Mistral models, and the open-source Llama models, to evaluate their performance in comprehensively analyzing pathology reports. Specifically, we assess their performance in cancer type identification, AJCC stage determination, and prognosis assessment, encompassing both information extraction and higher-order reasoning tasks. Based on a detailed analysis of their performance metrics in a zero-shot setting, we developed two instruction-tuned models: Path-llama3.1-8B and Path-GPT-4o-mini-FT. These models demonstrated superior performance in zero-shot cancer type identification, staging, and prognosis assessment compared to the other models evaluated.